18 research outputs found
Towards Energy Efficiency in Heterogeneous Processors: Findings on Virtual Screening Methods
The integration of the latest breakthroughs in computational modeling and high performance computing (HPC) has leveraged advances in the fields of healthcare and drug discovery, among others. By integrating all these developments together, scientists are creating new exciting personal therapeutic strategies for living longer that were unimaginable not that long ago. However, we are witnessing the biggest revolution in HPC in the last decade. Several graphics processing unit architectures have established their niche in the HPC arena but at the expense of an excessive power and heat. A solution for this important problem is based on heterogeneity. In this paper, we analyze power consumption on heterogeneous systems, benchmarking a bioinformatics kernel within the framework of virtual screening methods. Cores and frequencies are tuned to further improve the performance or energy efficiency on those architectures. Our experimental results show that targeted low‐cost systems are the lowest power consumption platforms, although the most energy efficient platform and the best suited for performance improvement is the Kepler GK110 graphics processing unit from Nvidia by using compute unified device architecture. Finally, the open computing language version of virtual screening shows a remarkable performance penalty compared with its compute unified device architecture counterpart.Ingeniería, Industria y Construcció
P systems simulations on massively parallel architectures
Membrane Computing is an emergent research area studying
the behaviour of living cells to de ne bio-inspired computing
devices, also called P systems. Such devices provide
polynomial time solutions to NP-complete problems by
trading time for space. The e cient simulation of P systems
poses challenges in three di erent aspects: an intrinsic
massively parallelism of P systems, an exponential computational
workspace, and a non-intensive
oating point nature.
In this paper, we analyze the simulation of a family of recognizer
P systems with active membranes that solves the Satis
ability (SAT) problem in linear time on three di erent architectures:
a shared memory system, a distributed memory
system, and a set of Graphics Processing Units (GPUs). For
an e cient handling of the exponential workspace created by
the P systems computation, we enable di erent data policies
on those architectures to increase memory bandwidth
and exploit data locality through tiling. Parallelism inherent
to the target P system is also managed on each architecture
to demonstrate that GPUs o er a valid alternative for
high-performance computing at a considerably lower cost:
Considering the largest problem size we were able to run
on the three parallel platforms involving four processors,
execution times were 20049.70 ms. using OpenMP on the
shared memory multiprocessor, 4954.03 ms. using MPI on
the distributed memory multiprocessor and 565.56 ms. using
CUDA in our four GPUs, which results in speed factors of
35.44x and 8.75x, respectively.Fundación Séneca 00001/CS/2007Ministerio de Ciencia e Innovación TIN2009–13192European Community CSD2006- 00046Junta de Andalucía P06-TIC-02109Junta de Andalucía P08–TIC-0420
The GPU on the simulation of cellular computing models
Membrane Computing is a discipline aiming to
abstract formal computing models, called membrane systems
or P systems, from the structure and functioning of the living
cells as well as from the cooperation of cells in tissues,
organs, and other higher order structures. This framework
provides polynomial time solutions to NP-complete problems
by trading space for time, and whose efficient simulation
poses challenges in three different aspects: an intrinsic
massively parallelism of P systems, an exponential computational
workspace, and a non-intensive floating point nature.
In this paper, we analyze the simulation of a family of recognizer
P systems with active membranes that solves the
Satisfiability problem in linear time on different instances of
Graphics Processing Units (GPUs). For an efficient handling
of the exponential workspace created by the P systems
computation, we enable different data policies to increase
memory bandwidth and exploit data locality through tiling
and dynamic queues. Parallelism inherent to the target P
system is also managed to demonstrate that GPUs offer a
valid alternative for high-performance computing at a considerably
lower cost. Furthermore, scalability is demonstrated
on the way to the largest problem size we were able to
run, and considering the new hardware generation from
Nvidia, Fermi, for a total speed-up exceeding four orders of
magnitude when running our simulations on the Tesla S2050
server.Agencia Regional de Ciencia y Tecnología - Murcia 00001/CS/2007Ministerio de Ciencia e Innovación TIN2009–13192Ministerio de Ciencia e Innovación TIN2009-14475-C04European Commission Consolider Ingenio-2010 CSD2006-0004
Comparative evaluation of platforms for parallel Ant Colony Optimization
The rapidly growing field of nature-inspired computing concerns the development and application of algorithms and methods based on biological or physical principles. This approach is particularly compelling for practitioners in high-performance computing, as natural algorithms are often inherently parallel in nature (for example, they may be based on a “swarm”-like model that uses a population of agents to optimize a function). Coupled with rising interest in nature-based algorithms is the growth in heterogenous computing; systems that use more than one kind of processor. We are therefore interested in the performance characteristics of nature-inspired algorithms on a number of different platforms. To this end, we present a new OpenCL-based implementation of the Ant Colony Optimization algorithm, and use it as the basis of extensive experimental tests. We benchmark the algorithm against existing implementations, on a wide variety of hardware platforms, and offer extensive analysis. This work provides rigorous foundations for future investigations of Ant Colony Optimization on high-performance platforms
Dynamic load balancing on heterogeneous clusters for parallel ant colony optimization
© 2016 Springer Science+Business Media New York Ant colony optimisation (ACO) is a nature-inspired, population-based metaheuristic that has been used to solve a wide variety of computationally hard problems. In order to take full advantage of the inherently stochastic and distributed nature of the method, we describe a parallelization strategy that leverages these features on heterogeneous and large-scale, massively-parallel hardware systems. Our approach balances workload effectively, by dynamically assigning jobs to heterogeneous resources which then run ACO implementations using different search strategies. Our experimental results confirm that we can obtain significant improvements in terms of both solution quality and energy expenditure, thus opening up new possibilities for the development of metaheuristic-based solutions to “real world” problems on high-performance, energy-efficient contemporary heterogeneous computing platforms
Exploiting Kepler Capabilities on Zernike Moments
This work analyzes the most advanced features of
the Kepler GPU by Nvidia, mainly dynamic parallelism for
launching kernels internally from the GPU and thread scheduling
via Hyper-Q. We illustrate several ways to exploit those features
from a code which computes Zernike moments, using two
different formulations: direct and iterative. This way, we compare
how well they can deploy parallelism on the new generation of
GPUs. The direct alternative tries to maximize parallelism, while
the iterative one increases the operational intensity by reusing
results coming from previous iterations. This has allowed us
to increase the speed-up factor attained on Fermi architectures
versus a code written in C and executed on a multicore CPU. We
also succeed on identifying the critical workload which is required
by a code to improve its execution on the new GPU platforms
endowed with six more times computational cores, and quantify
the overhead introduced by the new dynamic programming
mechanisms in CUD
Enhancing GPU parallelism in nature-inspired algorithms
We present GPU implementations of two different nature-inspired optimization methods for well-known optimization problems. Ant Colony Optimization (ACO) is a two-stage population-based method modelled on the foraging behaviour of ants, while P systems provide a high-level computational modelling framework that combines the structure and dynamic aspects of biological systems (in particular, their parallel and non-deterministic nature). Our methods focus on exploiting data parallelism and memory hierarchy to obtain GPU factor gains surpassing 20x for any of the two stages of the ACO algorithm, and 16x for P systems when compared to sequential versions running on a single-threaded high-end CPU. Additionally, we compare performance between GPU generations to validate hardware enhancements introduced by Nvidia’s Fermi architecture
Enhancing data parallelism for Ant Colony Optimization on GPUs
Graphics Processing Units (GPUs) have evolved into highly parallel and fully programmable architecture over the past five years, and the advent of CUDA has facilitated their application to many real-world applications. In this paper, we deal with a GPU implementation of Ant Colony Optimization (ACO), a population-based optimization method which comprises two major stages: tour construction and pheromone update. Because of its inherently parallel nature, ACO is well-suited to GPU implementation, but it also poses significant challenges due to irregular memory access patterns. Our contribution within this context is threefold: (1) a data parallelism scheme for tour construction tailored to GPUs, (2) novel GPU programming strategies for the pheromone update stage, and (3) a new mechanism called I-Roulette to replicate the classic roulette wheel while improving GPU parallelism. Our implementation leads to factor gains exceeding 20x for any of the two stages of the ACO algorithm as applied to the TSP when compared to its sequential counterpart version running on a similar single-threaded high-end CPU. Moreover, an extensive discussion focused on different implementation paths on GPUs shows the way to deal with parallel graph connected components. This, in turn, suggests a broader area of inquiry, where algorithm designers may learn to adapt similar optimization methods to GPU architecture